BigQueryにおける「標準SQL」と「レガシーSQL」の立ち位置について把握する

#SQL

#Google BigQuery

しんや

2020.03.14

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Google BigQueryでは、サービス内で利用出来るSQLとして「標準SQL(Standard SQL)」と「レガシーSQL(Legacy SQL)」という2種類のSQLが存在します。これらはそれぞれ、どういう特性や特徴があるのでしょうか。また両者の違いはや使い分け等も気になるところです。当エントリでは、それらを含めた両SQLの「立ち位置」について確認してみたいと思います。

BigQueryで利用可能なSQL：「標準SQL」と「レガシーSQL」

「標準SQL」とは、Google BigQueryのバージョン2.0から導入されたSQLの事です。

一方、BigQuery バージョン2.0を以て、それまで利用されていたSQLは「レガシーSQL」という扱いとなりました。

レガシーSQLリファレンス | BigQuery | Google Cloud

なお、基本的にはこの「標準SQL」を使うことが推奨とされています。(必須では無い)

...ということなんで、BigQueryをこれから使い始めるという場合であれば「そのまま標準SQLを使ってくれよな！以上！！」でおしまいなのですが、せっかくなので標準SQLでは何がどう変わったのか、標準SQLとレガシーSQLの実践例を幾つか実際に試してみたいと思います。

「標準SQL」と「レガシーSQL」の違いについて

「標準SQL」では「レガシーSQL」とは何が異なっているのか、という部分については以下のドキュメントにその内容がまとまっています。

下記は「利点」として、標準SQLがどの様な(レガシーSQLには無い)ものを持っているかを端的に説明した文言の引用です。(※その他型、構文、関数等についても違いについて言及されています。)

標準 SQL は SQL 2011 標準に準拠し、ネストされ繰り返されたデータのクエリをサポートする拡張機能を備えています。標準 SQL には、レガシー SQL に勝る次のような利点があります。

・WITH 句と SQL 関数を使用したコンポーザビリティ
・SELECT リストと WHERE 句内のサブクエリ
・相関サブクエリ
・ARRAY データ型と STRUCT データ型
・挿入、更新、削除
・COUNT(DISTINCT ) は正確でスケーラブルであり、EXACT_COUNT_DISTINCT の精度をその制限なしで提供
・JOIN による自動述語プッシュダウン
・任意の表現を含む複雑な JOIN 述語

標準 SQL への移行 | BigQuery | Google Cloud

ドキュメントでは、それぞれの手法で「標準SQL」「レガシーSQL」の利用における設定やコードサンプルが紹介されています。それに倣い、幾つか試してみたいと思います。

コンソール経由の場合

コンソール経由で実行する場合、実行クエリの#standardSQL,#legacySQLをプレフィックスで付けることで実行出来ます。

なお、設定でデフォルト挙動のSQLとしてどちらを使うかは切替可能です。クエリエディタで[その他]→[クエリ設定]のメニューを開き、

SQL言語の設定を切り替えることで対応出来ます。(※初期設定は標準SQL)

CLIコマンド(bq query)の場合

CLIコマンドの場合、bq queryコマンドでクエリを実行出来ます。この際、オプション指定で--use_legacy_sql=falseまたは--nouse_legacy_sqlを添えることで、そのクエリが標準SQLで実行されるようになります。

$ bq query \
--use_legacy_sql=false \
'SELECT
  name_alias,
  appearances,
  gender,
  year,
  years_since_joining
FROM
  cmbqdataset.marvel_characters
ORDER BY
  appearances DESC
LIMIT 20'
Waiting on bqjob_xxxxxxxxxxxxxxxxx_1 ... (0s) Current status: DONE   
+--------------------------------+-------------+--------+------+---------------------+
|           name_alias           | appearances | gender | year | years_since_joining |
+--------------------------------+-------------+--------+------+---------------------+
| peter benjamin parker          |        4333 | male   | 1990 |                  25 |
| steven rogers                  |        3458 | male   | 1964 |                  51 |
| james "logan" howlett          |        3130 | male   | 2005 |                  10 |
| anthony edward "tony" stark    |        3068 | male   | 1963 |                  52 |
| thor odinson                   |        2402 | male   | 1963 |                  52 |
| benjamin jacob grimm           |        2305 | male   | 1986 |                  29 |
| reed richards                  |        2125 | male   | 1989 |                  26 |
| robert bruce banner            |        2089 | male   | 1963 |                  52 |
| henry p. mccoy                 |        1886 | male   | 1976 |                  39 |
| susan richards (nee storm)     |        1761 | male   | 1989 |                  26 |
| ororo munroe                   |        1598 | female | 2012 |                   3 |
| namor mckenzie                 |        1561 | male   | 1985 |                  30 |
| clinton francis barton         |        1456 | male   | 1965 |                  50 |
| matt murdock                   |        1375 | male   | 2011 |                   4 |
| doctor stephen vincent strange |        1324 | male   | 2007 |                   8 |
| henry jonathan "hank" pym      |        1269 | male   | 1963 |                  52 |
| wanda maximoff                 |        1214 | female | 1965 |                  50 |
| janet van dyne                 |        1165 | female | 1963 |                  52 |
| natalia alianovna romanova     |        1112 | female | 1973 |                  42 |
| victor shade (alias)           |        1036 | male   | 1968 |                  47 |
+--------------------------------+-------------+--------+------+---------------------+

逆に、--use_legacy_sql=trueまたは--use_legacy_sqlと指定することで、そのクエリをレガシーSQLとして実行出来ます。上記のクエリをこのオプション指定で実行すると(レガシーSQLの利用制限には引っ掛からなかったので)結果は変わらず同じものが表示されますが、対応していないものにクエリ実行を仕掛けると、レガシーSQLを実行している事が分かる(エラー)メッセージが表示されます。

$ bq query \
--use_legacy_sql \
'SELECT * FROM INFORMATION_SCHEMA.SCHEMATA'
Error in query string: Error processing job 'cm-xxxxxxxxxxxxx:bqjob_xxxxxxxxxxxxxxxxxx_1':
Table cm-xxxxxxxxxxxxx:INFORMATION_SCHEMA.SCHEMATA is not queryable with legacy SQL.

クライアントライブラリ(Python)の場合

Pythonを介した標準SQLの利用例が以下。

# TODO(developer): Import the client library.
from google.cloud import bigquery

# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client()

query = """
SELECT
  name_alias,
  appearances,
  gender,
  year,
  years_since_joining
FROM
  cmbqdataset.marvel_characters
ORDER BY
  appearances DESC
LIMIT 20
"""
query_job = client.query(query)  # Make an API request.

print("The query data:")
for row in query_job:
    # Row values can be accessed by field name or index.
    #print("name={}, count={}".format(row[0], row["total_people"]))
    print(row)

実行結果：

$ time python exec-bigquery-standard-sql.py 
The query data:
Row(('peter benjamin parker', 4333, 'male', 1990, 25), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('steven rogers', 3458, 'male', 1964, 51), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('james "logan" howlett', 3130, 'male', 2005, 10), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('anthony edward "tony" stark', 3068, 'male', 1963, 52), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('thor odinson', 2402, 'male', 1963, 52), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('benjamin jacob grimm', 2305, 'male', 1986, 29), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('reed richards', 2125, 'male', 1989, 26), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('robert bruce banner', 2089, 'male', 1963, 52), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('henry p. mccoy', 1886, 'male', 1976, 39), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('susan richards (nee storm)', 1761, 'male', 1989, 26), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('ororo munroe', 1598, 'female', 2012, 3), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('namor mckenzie', 1561, 'male', 1985, 30), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('clinton francis barton', 1456, 'male', 1965, 50), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('matt murdock', 1375, 'male', 2011, 4), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('doctor stephen vincent strange', 1324, 'male', 2007, 8), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('henry jonathan "hank" pym', 1269, 'male', 1963, 52), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('wanda maximoff', 1214, 'female', 1965, 50), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('janet van dyne', 1165, 'female', 1963, 52), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('natalia alianovna romanova', 1112, 'female', 1973, 42), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})
Row(('victor shade (alias)', 1036, 'male', 1968, 47), {'name_alias': 0, 'appearances': 1, 'gender': 2, 'year': 3, 'years_since_joining': 4})

一方、レガシーSQLの例が以下。use_legacy_sql=Trueという設定を有効とすることでレガシーSQLを実行しています。(※出力結果は割愛)

from google.cloud import bigquery

# TODO(developer): Construct a BigQuery client object.
client = bigquery.Client()

query = (
"""
SELECT
  name_alias,
  appearances,
  gender,
  year,
  years_since_joining
FROM
  cmbqdataset.marvel_characters
ORDER BY
  appearances DESC
LIMIT 20"""
)

# Set use_legacy_sql to True to use legacy SQL syntax.
job_config = bigquery.QueryJobConfig(use_legacy_sql=True)

# Start the query, passing in the extra configuration.
query_job = client.query(query, job_config=job_config)  # Make an API request.

print("The query data:")
for row in query_job:
    print(row)

レガシーSQLで無ければならないケースはあるのか？

大きな方向性としては「無さそう」な気もしますが、下記ドキュメントで紹介されている「違い」や「比較」の項では、"レガシーSQLでは出来ているが、標準SQLでは対応していない、出来ない」ものも幾つか存在するようです。それらに合致する場合であれば「レガシーSQLで実装・実行せざるを得ない」というところでしょうか。可能な限り標準SQLで実装すべきなのでしょうが、前述の通り「標準SQLが必須」という訳でもないので、この辺は内容に応じて柔軟に対応...という形になる気もします。